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ABSTRACT 

performan^ ^rics^on^^ov^er'^skm^^nd*^ 6 ? 1 performa, J e froin ^ ,atcs * generation of high- 
error-prone. It is even more difficult " difficuIt “» 

computers. Finally, it is harder still to satktv ■ P^ be t0 ^ efficient on many different 

required of commercial software intended for use in a^ro^^rT"^ ^ mclude ^ re,ia bility and ease of use 
parallel processing technology to commercial software hac h«. _ nvir °nment. As a result, the application of 
computationally demanding programs that would sienifiSn^h! rt"S y even though there 316 numerous 
This paper describes DSSLffi wSsYljZv ^ *° m a PP ,ication ° f PWUel processing, 

computations in engineering and scientific software DSSLffi^mh^h 1 '^ 3 ' 5 th ® time ' consuinin g 
computation with a serial programxnine model that *r SSL , B b *** efficiency and speed of parallel 
code. The result is a .*“***■ ° f ^ ical ^ 

compromising maintainability, reliability portability or eace^f ^ CeSSUlg mt0 commercial software without 
powerful non-parallel entries in the Zkct * *** Significant Stages over less 




done. Even when developing^ ^ “ S gently difficult and time-consuming that it is rarely 
techniques that are specSTto a^^cWnfLd^T T? ° f °P“ on > * « common to use" 
architectures. As a program becomes more closely tied to cJLT P 0 ^ 1 ® to other hardware or software 
changes in order to adapt to changes mTeenyti«t environment it requires more extensive 

programs in a way that binds them to a specific hardware or LflSfln' WadeSmM ' C re * ult ofwntin g parallel 
the future will be even more difficult to STanS ZT th lhe deck " codes of 

today. A predictable result of the pace of technology change is !hau^es ^nt^LT 111 ^ W ® tave t0 ^ 
updates to adapt, customize, and optimize them fertile I Jt L „ , wntte "J oda y wI1 require more frequent 
maintenance phase to get e^en more e^nSl ' m f ^ 5X516,11 Therefore * we can expect the 
customize programs for a specific envfrJmment in order to m pre ^ tapp ™ ch of requiring that programmers 
method of implementing parallelism that avoids many rftS Pb f Sp ? ed - This paper first describes a 
U,e n describes a library of parallel linear algebra subroutines P °' ,ab ' e ^ “*• 

^ragrapKaZoC' 1 “““ * ^ »' liraed below and expanded upon in rhe 

‘ ~ ,o no "“ m * “■ ^ <■ -* 

’ixrr 0 ”’ “” d “ “ W— or 

L Parallelization can slightly change the numerical progenies of a given method. 


4 Wide variations in the performance characteristics of different parallel and distributed architectures 

make it difficult to write a single code that is efficient on a range of machine types 

5 Dependency on the run-time environment makes it difficult to write a single code atise lcien on a 

single machine type under varying system loads. 

Parallel algorithms especially those that use medium- and coarse-grain parallelism, are almost intrinsically 
SeSS For example. Urn order in rvhich *«*~**£^ « ^ “ 

cause significant changes in the internal behavior of a program. It is possible that some, 1 but , 

to ^Dhores or other global data will work properly. It is therefore possible that a bug will go 

with serial programs. 

Virtually none of the available parallel systems report faults or events that take place on a parallel CPU. For 
example, a division by zero or an overflow on a parallel CPU will generally go undetected by the host proce^r 
Virtually all of those systems that do report faults or events back to the host processor o so in a no 
manner. This can significantly complicate the task of debugging. 

Parallelism may change some of the numeric properties of a code. There are many well-known examples of this 
St one "Ss example is that splitting a summation in different ways can generate different results. 

Parallel and distributed architectures are available for all classes of machines ranging from PCs to supercomputers 
“h"f ^architectures has widely varying performance characteristics Jhe paralkhsm i on . . i given 
r/xmnniFr wctem mav be fine- medium, or coarse-grain parallelism, or it may be any combination of those three 
models Ftiie-grain parallelism can take the form of an instruction pipeline or independent fimctioi^ umts ma 
_: n oie CPU There may also be multiple processor types in a single computer, for example independent CPU and 
I/O Processors or a CPU and an FPU. Medium-grain parallelism is typically loop-level parallelism ^ eral 

nre^s^th a shared memory Coarse-grain parallelism can occur between any two processors regardless o 
whether they share a common memory. All of this variation makes it difficult to design a code that will run wel 
on many or all of the architectures. 

Finally variations in the run-time environment can make it veiy difficult to write code that is efficient even on a 
siS machine type but under varying work loads. For example, distributing a computationacross jrkstations 
a cluster can be done efficiently when the workstations are available and the network is lightly loaded but 

workstations are busy or if the network is heavily leaded Finally, changes in problem size can 

significantly change the performance characteristics of a particular parallel algorithm. 

BACKGROUND / EXISTING APPROACHES 

We start by considering the systems for parallel and distributed processing that are widely available today. They 
aonear to fall into one of three categories: remote procedure calls, subroutines libraries that provide parallelism 
pSe fXe-^Szed subroutine libraries. The systems that we consider as the representiitives of each of 
these categories are UNIX™ RPCs as implemented by Sun Microsystems [8], Parallel Virtual Machine, an 
LAPACK [1], 

RPCs are a mechanism in which a UNIX programmer can run a procedure on a remote machine using the simple 
of procedure call. Wte n invoked, a synchronous RPC transmits the argumcn^ tc a a remote 
machine which then executes the procedure and returns the results. In this way, RPCs provide distnbuted 
processing A layer called XDR tries to hide from the programmer machine-specific details about byte ordering, 
word length, and so forth by doing some of the data conversion necessaiy to make the data in the ar^nents 
understandable to the remote machine and to make the result from the remote machine understandable to the 
An asynchronous RPC is similar to a synchronous RPC except that the host does not wait for a resultfroma 
remotemachine after initiating a remote procedure. After initiating a remote procedure on one machine, the 
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may make one or more other asynchronous RPCs and in this way the host achieves parallel processing RPCs are 
supported by rpcgen, which allows a programmer to create RPC templates relatively easily. The strengths of the 

RPC are its ease of invocation using standard procedure call semantics and its relatively easy portability among 
UNIX operating environments. J 6 


TTie standard form of RPttXDR has many drawbacks. First, XDR has a clear bias towards C programs running on 
32-bit machines with IEEE floating point arithmetic and it has poor support for data types that are not common on 
Uus configuration. For example, FORTRAN’S complex data type (a data type not available in C), double precision 
floating point on a Cray (a 64-bit machine that does not use IEEE arithmetic), and BCD (often supported on IBM 
mainframes and 80x87 math coprocessors) are poorly handled by the standard XDR. While there is some 
portability among UNIX operating environments, there is essentially no hope of easily porting an RPC-based 
program to a non-UNIX environment. Synchronous RPCs do not allow parallel processing and the asynchronous 
RPCs hat do allow parallel processing are almost hopelessly difficult to use. Synchronous RPCs are also less 
portable than asynchronous RPCs. RPCs do not duplicate the machine state of the host machine on the remote 
machine so that special processing options selected on the host will not operate correctly on the remote machine, 
or examp e, if a program sets the IEEE rounding mode on the host then computations on the host will round 
correctly and computations on remote machines will round incorrectly. Finally, RPCs have no fault tolerance. If a 
temporary network glitch occurs or if a remote machine crashes while an RPC-based program is running then the 

program will hang or crash if the user is lucky, or the program will return the wrong answer with not even a hint 
oi trouble if the user is unlucky. 

Parallel Virtual Machine (PVM) is the representative of the class of parallel and distributed processing tools that 
are characterized by giving the user direct access to parallel and distributed processing primitives such as send 
receive, initiate task, synchronize, and so forth. Other systems that fall into this category are Linda, Express and 
the tasking mechanism built into Ada. PVM was developed by Dr. Jack Dongarra and his team at Oak Ridge 
National Laboratory (ORNL). It is a library of subroutines that gives a programmer close control over the 
parallelism employed by an application. PVM is more portable than RPC because PVM is not tied to a specific 
operating system. Dongarra and his team are considerably more scientifically oriented than the designers of RPC 
C °^£y handles data types from languages besides C and machines with configurations besides 32- 
bit CPUs using IEEE arithmetic. PVM is designed to allow parallel processing in addition to simply the 

distributed processing capability of synchronous RPCs. Parallel processing with PVM is much easier than with 
asynchronous RPCs. 


PVM is generally superior to RPC, but it has some drawbacks. From the perspective of a computer scientist, the 
power of PVM comes largely from the degree of control that the programmer can exercise over the process of 
parallelization. From the perspective of an atmospheric scientist, the problem with using PVM is the degree of 
control that the programmer must exercise over the process of parallelization. Many of the messy details of 
interprocessor communication that were concealed with RPCs are now the programmers problem. Another 
drawback to using PVM is that it requires that PVM-based programs be parallel or distributed programs PVM- 
based programs that are developed on a multiprocessor SPARCstation 10™ will run beautifully, in large part due to 
the extremely fast interprocessor communication that comes with shared memoiy. PVM-based programs that are 
run on a network of SPARCstation IPXs will run poorly, in large part due to the extremely slow interprocessor 
communication that comes with the Ethernet connection. Regardless of the extreme variations in efficiency 
between these two operating environments, PVM forces the program to behave in exactly the same way in both 
environments. Finally, PVM is slightly better than RPC at fault tolerance, but not much. If a fault occurs in a 
network or on a remote machine while a parallel computation is in progress, the application probably will fail. 

L |f PA< ~ K re P resents the approach of using parallel subroutine libraries. In contrast to PVM, whose subroutines 
allow the user to define the operations involved in building a parallel application, LAPACK is a library of 
subroutines that may be supplied to a user after being optimized and parallelized. LAPACK includes subroutines 
to perform many of the common operations in computational linear algebra including solving systems of linear 
equations, matrix factorizations, eigensystem solvers, SVD, and similar operations 
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Much of LAPACK is built on block operations, meaning that it divides a data set into subblocks that can be 
processed independently. It then does operations on those blocks. These blocks are then mapped for processing to 
the resources of a given machine. If there are multiple processors present then the blocks may be mapped to 
processors. The blocks may also be selected to correspond to the size of a cache for efficient memory access. The 
standard version of LAPACK as it comes from Oak Ridge National Laboratory is not optimized or parallelized, but 
the block structure does make it simpler to parallelize than other subroutine libraries that perform similar 
functions. LAPACK uses a subroutine called ILAENV to help it determine how each subroutine call should be 
blocked and so it is possible for ILAENV to react to changes in the environment and adapt its parallelism strategy 
accordingly. A major drawback to LAPACK with respect to its utility as a parallel programming environment is 
the same as its major strength, which is that the programmer has no concern with or control over the parallel 
processing. As a result, the programmer has no way to extend the parallel processing to get some capability that is 
not built into LAPACK. 


A PROPOSED SOLUTION: DSSLIB 

We have developed a parallelization system named DSSLIB that will avoid many, though not all, of the pitfalls of 
the available parallel programming systems. In particular, because it is a library, DSSLIB has the drawback 
present in LAPACK that a user cannot extend it to perform computations' that are not build in. DSSLIB is based 
on a combination of software programs transferred from a variety of US. Government agencies and projects. Some 
of the software has been in wide use since 1979 while others have been introduced as recently as 1991. 

Specifically, DSSLIB includes version 1.1 of LAPACK and the latest versions of LINPACK [3] and levels 1, 2, 
and 3 of the Basic Linear Algebra Subprograms (BLAS) [6, 5, 4]. We intend this system for use in production 
codes, including commercial software, users who are not sophisticated programmers of parallel or distributed 
processing machines, and for any user regardless of sophistication who needs a significant speedup in an 
application but does not have the resources to dedicate to a parallelization effort. 

The choice of target users implies that the software must have at least the following characteristics: 

1 . most or all of the parallelization must be automatic 

2. runs correctly and reasonably efficiently in a variety of hardware configurations 

3. complete fault tolerance. 

4. does not interfere with other software that may be in use, possibly including other parallelization 
systems 

5. compatible with all of the standard tools such as debuggers, profilers, etc. 

6. requires no changes to move among many different hardware and software configurations; retains all of 
the characteristics listed above even as it is being used in a variety of configurations 

DSSLIB satisfies the criteria above by presenting to an application a serial programming model even when it is 
running in parallel. A serial programming model means that DSSLIB appears to an application to be a standard 
library running on a single CPU. Some of the implications of choosing a serial programming model are: 

1 . for a given set of data, results will always be exactly the same regardless of how a particular 

computation was parallelized on a specific run 

1 . standard tools such as debuggers and profilers continue to work in the same way that they always have 

2. IEEE conditions are presented to an application in exactly the same way regardless of whether a 
computation is performed serially or in parallel 

3. DSSLIB always presents signals to an application in the same way every time 

4. parallel machines or processors use exactly the same environment as the host machine 

Given our choice of target users, one of the most important requirements is that all of the parallelization be 
automatic. When an application calls one of the parallel subroutines then DSSLIB determines how many 
processors to use, how to partition the data, and divides the work among the available processors. The number of 
processors to use is computed based on the size of the computation, expected performance from the network, 
expected performance from the other processors, and other factors. For a given computation, the number of 
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processors assigned may vaiy as an application runs due to changes in the factors that influence the number of 
processors assigned. However, DSSLIB partitions all computations in a way that guarantees that for a given 

computation the answer returned from DSSLIB will always be bitwise identical, regardless of the number of 
processors. 

As part of doing the automatic parallelism, DSSLIB records certain performance information about each 
computation and continuously tunes itself as a program runs. In addition to making the automatic parallelism 

DSSLffiSL t *Y nterMtl j lg / ide ' effeCt that large a PP ,ications ‘end to get faster as th^ran because 
t J n J e 0 l ! arn 31,(1 ada Pt t0 ^e environment. For example, an application may learn that the 
network is more lightly loaded than expected and that the cost of communicating with remote processors is less 

A t 3 m SU,t ’ “ T ay Ch00SC ‘° USC m ° re P rocessors for computations than it would have 
chosen by default. To illustrate this property, we modified the LINPACK 1000x1000 benchmark to solve six linear 
systems andrecord six tunes instead of just one. In that test, DSSLIB solved the sixth linear system 18% faster 
it solved the first system because it was able to apply to the sixth computation things that it had learned about 

the environment while doing the earlier computations. 

UL f3Ct ParallellS "! is automatic allows DSSLIB to be incorporated into production code, even commercial 

I! S P ° SSlb u! f °, r 3 researcher t0 sP^dy in advance a detailed and possibly very narrow description 
of the types of problems to be solved, a commercial software package will be presented with a variety of problems 
or varying sizes and shapes. The researcher writing a specialized code to solve a narrow set of problems may know 
m advance a close estimate to the optimal number of processors, but a commercial package cannot have such 
assumptions built in. The researcher may have available the luxuiy of being able to schedule a block of time on an 
. ^ SyStCm ^ P roduction code ^ be run in such an environment, but it also needs to work 

3nd adaPtlVC Paral,e,iSm in DSSLIB *■" * — ed of 

Of course, one of the characteristics of a serial programming model is that a program generally will not be 
adversely affected by problems in the network or on other computers. To sup^S aspe^of a seri^ 
programming model, DSSLIB has been built completely fault tolerant. If one or more parallel machines crash due 
to problems in hardware software, or network then DSSLIB will detect the problem and automatically restructure 
the computation so that it will complete correctly. Further, it will restructure the computation in mch a ™ m 
guarantee that the answer from the restructured computation will be bitwise identical to the answer that would 

b^h^ C ° mPUted Wnditi ° nS <*' failures on the host machine can hurt Z%£SL 

but this also is consistent with the serial programming model. As with the automatic parallelization above the 
compensation for faults or errors is automatic and does not require any special code on the part of the user.’ 

C m 0rS °r SUCh “ EEE conditions ’ "e presented to an application in exactly the same way every 
attemntft r<Ue ? ° f wheth f r 3 computation is performed serially or in parallel. For example, if an application 
attempts to solve a singular linear system then the subroutine DGESL will divide by zero (This is standard 

T ACK " 0t a DSSL[B Urns by 

cation to a user s application. DSSLIB will return divide by zero or any other condition to a user's 

an£t CX !? y 35 lf ll H had h<Xn ° n 3 Sing,e CPU AJso > DSSLIB always presents multiple signals to an 
appl cation in the same order every time. Consider a computation that will be performed in parallel fnwhich one 

parallel machine will divide by zero and another will get an overflow. DSSLIB guarantees that those signals will 
always be presented to the user's application in the same order every time, just as they would be if the computation 
were performed on a single CPU. DSSLIB has no race conditions common in other parallel systems. PUtaU ° n 

r:S maC !T neS ° f l™ 5 ™* CXacUy ^ 5311,6 envir <“t as the host machine. For example if an 
pp on changes the IEEE rounding mode to round towards zero instead of round to nearest then all parallel 
machines or processors will round towards zero. Other parallel processing systems do no^re ^fpS 
computations are performed in the environment requested by an application. 
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RESULTS 


Of course, the acid test of any parallel or distributed system is speed. If it is not fast then none of its other 
characteristics are particularly interesting. It is even more helpful if the system is fast on real applications, rather 
than running well only on a selected set of benchmarks. DSSLIB is fast. Measurements on applications with run 
times ranging from 90 seconds to 12 hours shows that it delivers a pleasing level of performance for a reasonable 
variety of applications. Further, DSSLIB satisfies the requirement that it run well in a variety of hardware and 
software environments with no changes required of the user. 

The hardware configurations in which DSSLIB was tested for this paper include both parallel and distributed 
processing where processors are defined to be parallel if they share a common memory. The shared memory 
machine used for this paper was a dual processor SPARCstation 10 with a 40 MHz SuperSPARC processor. 
Processors are defined to be distributed if they do not share a common memory but are linked via some network 
such as Ethernet or FDDI. Distributed processing machines used for this paper were single processor 
SPARCstation IPXs linked by a network. Networks used were the standard Ethernet and the SBUS FDDI product 
from Network Peripherals. 

Both fine- and coarse-grain parallelism was tested. Fine-grain parallelism was done by making best use of the 
parallelism between the floating point and integer CPUs, and also between the independent add and multiply units 
in the floating point unit. On the SS10, additional fine-grain parallelism was measured by taking advantage of the 
multiple instruction per cycle capability, though this was limited by the fact that floating point instructions launch 
one at a time. 

The software on which DSSLIB was tested included a matrix multiplication benchmark, a small image processing 
program written in IDL 1 that has a run time of 90 seconds, an artificial neural network written in FORTRAN 77 
with run-times of 19 and 58 minutes for two data sets, and a discrete ordinate radiative transfer program [7] 
written in FORTRAN 77 with a run-time of 12 hours. 

We ran the matrix multiplication benchmark on a single-CPU SPARCstation 10 model 40, then again on a dual 
processor machine. This benchmark simply generates 400x400 matrices and computes aAB + PC -► C. The 
comparison below shows the speed of the standard form of DGEMM from netlib and the speed of the same 
subroutine from DSSLIB. The DSSLIB subroutine is timed on one and two processors where the single processor 
run takre advantage of only fine-grain parallelism and the two processor run takes advantage of all levels of 
parallelism. 



Original DSSLIB DSSLIB with 

2 processors 


! IDL is an interpreted data modeling language from Research Systems, Inc. It appears to a user to be nearly 
identical in most ways to PV-WAVE from Visual Numerics, and the results reported for IDL are virtually identical 
to the results of similar experiments with PV-WAVE. 
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The neural network tested was part of a NASA project to classify cloud formations extracted from satellite data 
The large Landsat data set was first reduced to a set of feature vectors extracted during a preprocessing phase and 
these feature vectors were used as input to the neural network. The neural network then classified the cloud 
formations in the image according to the information found in the feature vectors. The graph below shows the 
results obtained with a small data set. This data set is sufficiently small that it runs on a single CPU and so uses 

only fine-grain parallelism. The graph shows wall clock run time, so smaller values indicating shorter run times 
are better. 


1200 

^ 1000 
o 
0 > 

800 

0 ) 

E 600 

C 400 
3 

K 200 
0 


As one can clearly see from the graph above, the introduction of DSSLIB had a significant positive effect on the 
performance of the program, even for modest-sized data sets. Based on these results, Logar and Corwin decided to 
eliminate the data reduction in the preprocessing phase and send the raw satellite data directly to the neural 
network. In its present form, the neural network is limited by the size of the main memory on the Sun workstation 
rather than by the speed of the Sun CPU. The results below are for two Sun IPX workstations linked by an SBUS 
FDDI card from Network Peripherals. 
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DISORT is an application available used by NASA Goddard Space Right Center to compute the thermal budget of 
a two-dimensional multi-layer region of the Earth's atmosphere. Each individual horizontal layer is required to be 
homogeneous, but different layers may have different characteristics. This program is dominated by eigenvalue 
computations done with modified subroutines from EISPACK. It also uses a significant amount of CPU time on 
LINPACK and various matrix algebra computations from the BLAS libraries. Because the EISPACK subroutines 
had been modified, there is no safe way to replace them with LAPACK subroutines, which is what one would 
usually do. However it is possible to insert a few BLAS calls into the eigenvalue subroutines and other places in 
the program. We did this in a way that provided some speed improvement, but did not compromise the accuracy or 
portability of the program. The results are shown in the following graph. 



Original DSSLIB 
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Original DSSLIB 

As one would expea from a program dominated by eigenvalue computation, parallelism provides significant 
speedup, but the improvement is not a faaor of two for two processors. Nevertheless, DSSLIB cut three hours 
from a seven hour run using two processors. The customer is now able to make runs that would have been 
prohibitively expensive before these changes. 



SUMMARY 

DSSLIB is a library built on a parallel processing system that presents to an application a serial programming 
model. This serial programming model simplifies the development of a parallel application because it hides from 
the programmer the difficult details of parallelization. It also allows DSSLIB to be incorporated into production or 
commercial software because it is able to adapt to the variety of configurations and environments in which such 
software is used. 
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ABSTRACT 


h ig h -fPe«| data search system developed for KSC incorporates existing and emerging information 
retneval technology to help a user intelligently and rapidly locate information found in large textual databases. 
Tins technology includes: natural language input; statistical ranking of retrieved information; an artificial intel- 
ligence concept called semantics, where "surface level" knowledge found in text is used to improve the ranking of 

I?^™^. f ° rn !^ 1 r n: .f ld relc ™ noe , fee dhack, where user judgments about viewed information are used to 
automatically modify the search for further information. Semantics and relevance feedback are features of the 
system which are not available commercially . The system further demonstrates a focus on paragraphs of information 
to decide relevance; and it can be used (without modification) to intelligently search all kinds of document col- 
ections, such as collections of legal documents, medical documents, news stories, patents, and so forth. The 
feedback ^ ^ papCr K tC> d em °nstrate the usefulness of statistical ranking, our semantic improvement, and relevance 

INTRODUCTION 

. amounts of natural language documents (text) is an important problem. 
StXf k 1 J? C ^ arch i ng press releases ^d numerous other documents to quickly answer media questions, 

accessing bulky manuals and schematics compactly stored on a CD via a laptop computer, and retrieving digital 
images by means of their catalog descriptions. P l ‘cmcving uigitai 

pri Pl ar X “ tent °f our work has been to provide convenient access to information contained in the numerous 
and large public information documents maintained by Public Affairs at NASA Kennedy Space Center fKSCi 

^^^^t^KSC othe^NAS^ KSC consist of press releases, smd oth^^Mint^jnforrnmSn 

o-eated at KSC, and other NASA offices using various wordprocessors. There are also documents from outside 
contractors, such as Rockwell, which produces the "NASA National Space Transportation System Reference" more 

he ShUUlC "] anuaL " - During a launch at KSC’ 3150111 3 NASA employees access these printed 

tmiESri media , qu ^! ons - J he planned document storage for NASA KSC Public Affaire is around 
300,000 pages (approximately 900 megabytes of disk storage). 

Current commercial text retrieval systems focus on the use of keywords to search for information These 
systems typically use a Boolean combination of keywords supplied by the user to retrieve documents. In general 

because tbqr aU™ Mural language input. These sys.ems have bee. , 

SS^th^^ ™ ™ 5“ rC ‘? l f V 1 al lecllr,i< l“ e in '° tor g' operational systems has been very slow because, unlii 
th re w f s " 0 evidencc that statistical ranking could be done in real-time on large document collections 

SSS Uni " d s,a ‘“ whlcl1 aUow " a “ al ‘ ang “ age " put md p" 10 ™ 

The QA System incorporates two other features which are not available in any commercial text retrieval 
22* bu , t hav .® . been shown to dramatically improve the statistical ranking of retrieved information. The first is 

fhe^nldno nfrilrtf mantles, where "surface level" knowledge found in text is used to improve 
the ranking of retrieved information. The second is relevance feedback, where user judgments concerning viewed 
information are used to automatically modify the search for more information. ® 
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The QA System is very close to being a commercial product. It has been used to participate in a (first) Text 
Retrieval Conference (TREC-1) managed by the National Institute of Standards and Technology (NIST). Our 
participation in TREC-1 was funded by the Defense Advanced Research Projects Agency (DARPA). Participation 
in TREC-1 has enabled the QA System to be tested in an environment other than answering questions, and applied 
to databases other than aerospace text collections [3]. 

Conventional information retrieval using statistical ranking is demonstrated first in this paper. Demonstrations 
of improved statistical ranking due to the use of semantics within the QA System are then presented for comparison. 
This is followed by a demonstration of relevancy feedback within the QA System. In all demonstrations, the focus 
on paragraphs of information for retrieval will be evident. Finally, the issues of platforms and high-speed for the 
QA System are discussed in the Conclusion. 

CONVENTIONAL INFORMATION RETRIEVAL 

Finding relevant text and ranking the retrieved documents is not new and there are commercial systems which 
already perform this activity; we mention here an example of ranked, relevant text retrieval. For a demonstration 
to NASA KSC, the 1000 page shuttle manual was used by considering each paragraph of the manual as a document. 
This resulted in a collection of 5143 documents. A commercial hypertext lR system called SPIRIT [11] was used 
to automatically index the collection and provide natural language access. SPIRIT is a mainframe system . Running 
on an IBM 4381, SPIRIT required three and one-half hours of dock time to index the collection of 5143 documents. 

Figure 1 is a screen generated by SPIRIT for asking the natural language query 

What are the dimensions of the cargo area In the shuttle? 

Figure 2 is a screen generated by SPIRIT revealing a ranked list of 245 relevant documents with CLASS 1 being 
the most relevant. Figure 3 is a screen generated by SPIRIT revealing the first document in CLASS 6, which 
contains the answer to the query. This paragraph was found by reading the single paragraph in CLASS 1 first, then 
the single paragraph in CLASS 2, and so on until the answer was read in the tenth paragraph. 


NATURAL LANGUAGE QUERY ON THE SHUTTLE BASE 
<1>: What are the dimensions of the cargo area in the shuttle? 
EMPTY WORDS: What, ate, the, of, the, in, the. 
KEYWORDS: dimensions, cargo, area, shuttle. 


Figure 1. Natural Language Query to the SPIRIT System. 


CLASSES 

NB DOCS 

KEYWORDS 

1 

1 

dimensions, cargo, shuttle. 

2 

1 

cargo, area, shuttle. 

3 

1 

dimensions, area. 

4 

2 

dimensions, shuttle. 

5 

4 

cargo, area. 

6 

30 

cargo, shuttle. 

7 

12 

area, shuttle. 

8 

7 

dimensions. 

9 

40 

cargo. 

10 

147 

area. 

BOTTOM OF LIST 




Figure 2. Document Classes Generated by the SPIRIT System. 
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DOC 0005 BASE : doc 0005NCP:0/CPI:1/NBI:1+18 1K/1K 
IDENTIFIER. : doc 0005 
TEXT. : 

The shuttle will transport cargo into near Earth orbit 100 to 217 nautical miles (1 15 - 250 
statute miles) above the Earth. This cargo (called payload) is carried in a bay 15 feet in 
diameter and 60 feet long. 

BOTTOM OF DOCUMENT 

INFORMATIONAL PAGE 1/1 


WHAT DO YOU WANT TO DISPLAY? 

> OR RETURN, <,»,«, DOC, END J>DQ,(?): 


Figure 3. Document Display by the SPIRIT System. 


Note that performance in this QuestioiVAnswer environment is measured by counting how many documents 
were exam ined to find the document containing the answer. This is not the usual way of measuring the performance 
of IR systems, but it is very appropriate for a Question/Answer environment 

The underlying principles and algorithms of automated IR systems like SPIRIT are well-known. Terms used 
as document identifiers are keywords modified by various techniques such as stop lists (removal of useless or empty 
words), stemming, synonyms, and query reformulation. Here, we present basic concepts associated with the cal- 
culation of weighting factors. 

The calculation of the weighting factor (w) for a term in a document is a combination of terra frequency (tf) 
document frequency (df), and inverse document frequency (idf). The basic term definitions are as follows: 


tfu - number of occurrences of term 7} in document D t 
dfj - number of documents in a collection which contain T f 

idf,- '<) , where N - total number of documents 
w il -tfn'idf,. 


When an IR system is used to query a collection of documents with t terms, the system computes a vector Q 
equal to (w,„ w tJ> ..., w„) as the weights for each term in the query. The retrieval of a document with vector D i 
equal to (d il ,d i2> ...,d il ) representing the weights of each term in the document is based on the value of a similarity 
measure between the query vector and the document vector. A common similarity function which normalizes the 
the similarity coefficient in case of different document sizes is the following: 


simiQ'D^ 


z /.i 


( 1 ) 


It is important to note that the calculation of a similarity coefficient for each document and the ranking of the 
documents relevant to a query is rather time consuming. This is due to the summations that occur in the above 
formula and the fact that every document that has a term in common with a given query must be considered. The 
main problem with text retrieval using statistical ranking has been the time required to produce the document 
ranking given a query. Consequently, query response time has been typically slow. 

SEMANTIC APPROACH 

Although the basic statistical ranking approach (as demonstrated by SPIRIT) has shown some success in 
regard to natural language queries, it ignores some valuable information. We now know that these systems can be 
further improved by imposing a semantic data model upon the "surface level" knowledge found in text. 


Semantic Modeling 

Semantic modeling was an object of considerable database research in the late 1970’s and early 1980’s Til 
Essentially, the semantic modeling approach identified concepts useful in talking informally about the real world! 
These concepts included the two notions of entities (objects in the real world) and relationships among entities 
(actions in the real world). Both entities and relationships have properties. 
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The properties of entities are often called attributes. There are basic or surface level attributes for entities in 
the real world. Examples of surface level entity attributes are Size, Color, and Position. These properties are 
prevalent in natural language. For example, consider the phrase "large, black book on the table," which indicates 
the Size, Color, and Position of a book. 

In linguistic research, the basic properties of relationships are discussed and called thematic roles. Thematic 
roles are also referred to in the literature as participant roles, semantic roles, and case roles. Examples of thematic 
roles are Beneficiary and Time. Thematic roles are prevalent in natural language, they reveal how sentence phrases 
and clauses are semantically related to the verbs in a sentence. For example, consider the phrase "purchased for 
Mary on Wednesday" which indicates who benefited from a purchase (Beneficiary) and when a purchase occurred 
(Time). 

Consider the following query: 

How long does the payload crew go through training before a launch? 

The basic statistical approach dismisses the following words in the query as empty: "how", "does", "the", "through", 
"before”, and "a". Some of these words contain valuable semantic information. The following list indicates some 
of the thematic roles triggered by a few of the words in the above query: 

long -»■ Duration, Time 

through Location/Space, Motion With Reference To Direction, Time 
before => Location/Space, Time 

As another example, consider the query in Figure 1: 

What are the dimensions of the cargo area in the shuttle? 

The keyword "dimensions" indicates the attribute General Dimensions and the keyword "area” indicates both the 
thematic role Location/Space and the attribute General Dimensions. It would be reasonable to expect that the 
document that answers this query would have words in it that fall in the category of General Dimensions. 

The primary goal of the QA System has been to detect thematic and attribute information contained in natural 
language queries and documents. When the information is present, the system uses it to help find the most relevant 
paragraph to a query. In order to use this additional information, die basic underlying concept of text relevance 
was modified. The major modifications include the addition of a lexicon with thematic and attribute information, 
and a modified computation of the similarity measure given in (1). 

The Semantic Lexicon 

The QA System uses a thesaurus as a source of semantic categories (thematic and attribute information). For 
example, Roget’s Thesaurus contains a hierarchy of word classes to relate word senses [5]. For our research, we 
have selected several classes from this hierarchy to be used for semantic categories. We have defined thirty-six 
semantic categories as shown in Figure 4. 

In order to explain the assignment of semantic categories to a given term using Roget’s Thesaurus, consider 
the brief index quotation for the term "vapor": 

vapor 

n. fog 
fume 
illusion 
spirit 
steam 

thing imagined 
v. be bombastic 
bluster 
boast 
exhale 

talk nonsense 


404.2 
401 
519.1 
43 

328.10 

535.3 
601.6 

911.3 
910.6 
310.23 
547.5 
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Thematic Role Categories 


Attribute Categories 

Accomoaniment 


Color 

Amount 


External and Internal Dimensions 

Beneficiary 


Form 

Cause 


Gender 

Condition 


General Dimensions 

Com oari son 


Linear Dimensions 

Conveyance 


Motion Conjoined with Force 

Decree 


Motion in General 

Destination 



Duration 


Order 1 

Goal 



Instrument 


Position 

Location/S pace 


State 

Manner 


Temoerature 

Means 


Use 

Purpose 


Variation 

Ranee 


.Result 


Source 


Time 



Figure 4. Thirty-Six Semantic Categories. 


The eleven different meanings of the term "vapor" are given in terms of a numerical category. We have developed 
a mapping of the numerical categories in Roget’s Thesaurus to the thematic role and attribute categories given in 

TWr»n<» 4 ; lD h H fog " and ," fume " correspond to the attribute State; "steam" maps to the ^tribute 

Temperature; and exhale is a trigger for the attribute Motion with Reference to Direction. The remaining seven 
W « h V3p ? r " d0 not u any thematic roles or attributes. Since there are eleven mornings 
o V n?° r ’ we “ d i cate “ *** lexicon a P robabil ity of 1/1 1 each time a category is triggered. Hence 
° f 2 f l \ B , a f lgncd to.^te. I/ 1 1 to Temperature, and 1/1 1 to Motion with Reference to Direction. 
, f catcntafing probabilities is being used as a simple alternative to a corpus analysis. It should be 
pointed out that we are still experimenting with other ways of calculating probabilities. 3 

Extended Computation of the Similarity Measure 

The probabilistic details of a semantic lexicon and the computation of semantic weights can be found in 1131. 
be foum^fn 2 ^ 1 ] 00 ** manner ™ wl “ c h QA System combines semantic weights and keyword weights 

Essentially we treat semantic categories like indexing terms, and the probabilities introduced by a semantic 
lexicon mean that the frequency of a category in a document becomes an expected frequency and the presence of 
a category in a document becomes a probability for the category being present. This means that the document 
frequency for a category becomes an a spect^ document frequency, and this enables an inverse document frequency 
to be calculated for a category. ^ J 

, So the computation of a simUarity coefficient as shown in (1) can be used, but now the summations in the 

formulas include semantic categories in the documents as well as terms in the documents. In other words, 


sim(Q,D) m + 


( 2 ) 


where s - 36 is the number of semantic categories, and T and B are scaling factors for adjusting the blend. 
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SEMANTIC IMPROVEMENT 


The QA System has demonstrated a noticeable semantic improvement using the similarity function in (2). 
Consider the same document collection and natural language query shown in the commercial system example of 
Figures 1, 2, and 3. Using the commercial system SPIRIT, ten paragraphs were read in order to find the answer 
to the following query: 


What are the dimensions of the cargo area in the shuttle? 

Considering the QA System, Figure 5 is a screen generated for asking this same natural language query. Figure 6 
is a screen generated by the QA System graphically showing to the user the importance of the keywords found in 
the query. Figure 7 is a screen generated by the QA System graphically showing to the user the importance of 
semantic information found in the query. Notice the "importance" of the semantic category General Dimensions 
in the screen shown in Figure 7. This long bar means that the semantic category General Dimensions is present in 
the query and there are very few documents retrieved (using keywords) having this type of semantic content. Hence, 
the importance of the category. 

Finally, Figure 8 is a screen generated by the QA System revealing the second paragraph found by proceeding 
through the ranked list of documents retrieved by the QA System for this query. The semantic information found 
in the query and displayed in Figure 7 is the reason the QA System ranked the answering paragraph second instead 
of tenth as did the SPIRIT system. Notice that the answering document in Figure 8 has several words in it which 
trigger the semantic category General Dimensions. We have lots of data like this and several technical papers 
which reveal a significant performance improvement due to semantic modeling in the NASA KSC Question/Answer 
environment 

For another example of semantic improvement, consider the shuttle manual and the query: 

How fast does the orbiter travel on orbit? 

This query is interesting for two reasons. One is that the words "orbiter" and "orbit" are rather frequent words in 
the shuttle manual so lots of paragraphs are retrieved. The other reason is that the word "fast" is used for reference 
to velocity or speed. 

Figure 9 shows the number of paragraphs one must read to find a particular answering paragraph to this query 
for both a small and large collection of documents. In the small collection, the word "fast" does not occur at all 
and for the large collection, the word "fast" never occurs in an answering paragraph. Consequently, keyword only 
statistical ranking is never very good. But by using semantics, the word fast causes a similarity to paragraphs using 
the words velocity or speed. Consequently, semantics improves the statistical ranking of an answering paragraph. 
Different blends of keywords and semantics are shown using the similarity function in (2). 

RELEVANCE FEEDBACK 

It has been pointed out that conventional IR systems have a limited recall [6]; only a few relevant documents 
are retrieved in response to user queries if the search process is based solely on the initial query. This indicates a 
need to modify (or reformulate) the initial query in order to improve performance. It is customary to search the 
relevant documents iteratively as a sequence of partial search operations. The results of earlier searches can be 
used as feedback information to improve the results of later searches. One possible way to do this is to ask the 
user to make a relevance decision on a certain number of retrieved documents. Then this relevance information 
can be used to construct an improved query formulation and recalculate the similarities between documents and 
query in order to re-rank them. Ibis process is known as relevance feedback [7,8,9,10] and it has been shown 
experimentally to improve the performance of the retrieval system. 

The basic assumption behind relevance feedback is that, for a given query, documents relevant to it should 
resemble each other in a sense that they have reasonably similar keyword vectors. This implies that if a retrieved 
document is identified as relevant, then the initial query can be modified to increase its similarity to such a relevant 
document. As a result of this reformulation, it is expected that more of the relevant documents and fewer of the 
nonrelevant documents will be extracted. 

The automatic construction of an improved query is actually straightforward, but it does increase the com- 
plexity of the user interface and the use of the retrieval system , and it can slow down query response time. Essentially, 
the terms and semantic categories for documents viewed as relevant to a query can be used to modify the weights 
of terms and semantic categories in the original query. A modification can also be made using documents viewed 
as not relevant to a query. Experimental results show a very promising improvement for relevance feedback within 
the QA System. 
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QA SYSTEM Prototype 4.0 


Query Sytfem 


srrsFix 


QUERY INFORMATION 
' n YELL0W wil1 h* designated as a Useful Wort, and any wort in BLACK will be designated I 


l , ... . _ QUERY INPUT 

Wtiat are the dimensions of the cargo area in the shuttle? 


Suggestion: 


Describe what you want to know. 

For example - Velocity or speed of the shuttle on oibiL 
Use words you would expect to see. 

For example - The vab is 525 feet tail. 


Figure 5. Natural Language Query to the QA System. 

QA SYSTEM Prototype 4.0 Keyword S ummar y 


P ress < F1> for help 
Press <ENTER> to accept changes 
Press <ESC> to go back 



Figure 6. Keyword Summary by the QA System. 


□ 


STSFIX 



IMPORTANCE 

USE II 

shuttl - . . H 

1 are 


NO 

SSL™ 


NO 

NO 


NO 


QA SYSTEM Prototype 4.0 Semantic Summar y 


STSFIX 



P ress < F1> for help 
Press <ENTER> to accept changes 
Press <ESC> to go back 



Figure 7. Semantic Summary by the QA System. 


Document 0005 


Page: 1 


RELEVANT DOCUMENT #2 


wili transport cargo into near Earth orbit 100 id 217 nautical miles (113 to 
Earth. This cargo (called payload) is earned in a bay 15 feet in diameter and 60Teet long. * 


End of document 


Page Up, Page Down, Ctrl Page Up, CrtJ page Down, Del, Esc 


Figure 8. Document Display by the QA System. 
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T-JB -1.10206 T -5-8.0 
Blend of Blend of 

Keywords Keywords Keywords 

Only and Semantics and Semantics 


First 26 pages of 
the shuttle manual 
(160 documents) 

19 

4 

2 

The entire 
shuttle manual 
(5143 documents) 

145 

126 

14 


Figure 9. Number of paragraphs read to find a particular answering paragraph for: 
How fast does the orbiter travel on orbit? 


Figure 10 provides an example using the first 26 pages of the shuttle manual and the query: 

How fast does the orbiter travel on orbit? 

Recall from Figure 9 that 19 paragraphs were read to find an answering paragraph. The document identifiers for 
these 19 paragraphs are shown in the left column of Figure 1 1 along with the notes that Document #13 and Document 
#16 were considered relevant to the original query, and Document #14 answered the query. All the other viewed 
documents were not relevant to the query. 

If relevance feedback is selected within the QA System and the system is told to display two documents and 
then reformulate the query, then the documents shown in the right column are viewed. Each document viewed 
must be tagged as relevant or not-relevant. Document #14 shows up earlier in the statistical ranking primarily 
because Document #13 was tagged as relevant to the original query. 

It is interesting to note that if one tags Document #14 (which answers the query) as relevant, then Document 
#87 is retrieved and it almost exactly answers the query. Document #87 would never be retrieved using just 
keywords without feedback because it has no keywords in common with the original query. Documents 13, 14, 
16, 69 and 87 are shown in Figure 11. The keywords that these documents have in common with the original query 
are underlined. Clearly, Document 69 is not relevant to the original query. 

CONCLUSION: PLATFORMS AND THE ISSUE OF HIGH SPEED 

Originally, the QA System was restricted to an IBM compatible PC platform running under the DOS operating 
system and without the use of any other licensed commercial software such as a DOS extender. The QA System 
is implemented in Borland C and one version uses B+ tree structures for the inverted files. We felt the speed of 
the system and its storage overhead was not efficient so a hashing scheme was added to eliminate the use of B+ 
trees and provide codes for keywords. We expected this second version to have improved indexing time, storage, 
and retrieval speed. 

Experiments revealed that indexing time of the QA System did not improve much. We were not surprised 
because the QA System is restricted under the PC DOS platform. This platform has a serious memory addressing 
restriction which results in memory page swapping and this seriously affects the speed of processing, especially 
during creation of the hashing table and index structures. The improvement in storage, however, was very 
impressive. It is very much matched to our objective which is to make our storage ratio of indexes to text, around 
0.5. This is comparable to the ratio of very efficient, retrieval systems using statistical ranking. 

Addressing the high speed issue, we now have the Borland C compiler for OS/2 so we expect to have a very 
high speed QA System running under OS/2 very soon. We are also in the process of converting the QA System 
to run in the UNIX environment. Figure 12 reveals achieved and projected run-time performances of the QA 
System on different operating system platforms. The DOS, B+ tree version of the system is shown in the upper 
left comer. Below (diagonally) are shown the OS/2, UNIX B+ tree and hashing versions of the QA System for 
different amounts of RAM. Indexing and typical query response times are shown for both a small (2.4 megabyte) 
and a large (1.2 gigabyte) document collection. Data for this chart was obtained in part from experiments performed 
for TOEC-1 [3]. 
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160 Documents 




Answer can be found in Document 14, 87 



Keywording 

Relevance Feedback (view 2) 

1 

69 

~ 1 

69 


2 

13 Relevant 

2 

13 

Yes 

3 

82 

” 3 

82 


4 

15 

4 

107 


5 

123 

~ 5 

85 


6 

106 

6 

124 


7 

85 

7 

16 

Yes 

8 

124 

8 

14 

Yes, Answer 

9 

10 

21 

23 

9 

87 

Yes, Answer 

11 

24 




12 

83 




13 

31 




14 

26 




15 

16 Relevant 




16 

84 




17 

11 




18 

12 




19 

14 Answer 




* 

• 

never 

get 87 (no query words in 87) 





Figure 10. Relevance Feedback Improvement for the Query: 
How fast does the orbiter travel on orbit? 


Document 13 

The two o rbital maneuvering system engines are used to place the orbiter on orbit, for major 
velocity maneuvers on fiiM and to slow the orbiter for re-entry, called the deorbit maneuver. 
N°nnaUy, two orbital maneuvering system engine thrusting sequences are used to place the orbiter 
on and only one thrusting sequence is used for deorbit. 

Document 14 

The arbiter’s velocity on qML is approximately 25,405 feet per second. The deorbit maneuver 
decreases this velocity approximately 300 feet per second for re-entry. 

Document 16 

For deorbit, the a rbiter is rotated tailfirst in the direction of the velocity by the primary reaction 
control system engines. Then the orbital maneuvering system engines are used to decrease the 
orbiter’s velocity. ° 


Document 69 

-Atlanto(OV-104), after a two-masted ketch operated for the Woods Hole Oceanographic Institute 
from 1930-1966, which traveled more than half a million miles in ocean research. 

Document 87 

Entry interface is considered to occur at 400,000 feet altitude approximately 4,400 nautical miles 
(5,063 statute miles) from the landing site and at approximately 25,000 feet per second velocity. 


Figure 11. Documents 13, 14, 16, 69, and 87. Keywords in 
common with the original query are underlined. 
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Figure 12. Run-Time Performance of the QA System. 
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